Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

388

The Organization of Knowledge

“self-organization”. ¹⁰In actual practice the mining is not completely autonomous;

the miner predeﬁnes classes onto which the data items will be mapped (supervised

learning from data—also known as “intelligent data analysis”), just as a real miner

generally knows what minerals he is seeking (but, to be sure, a good miner would be

open to ﬁnding and extracting other minerals that might unexpectedly occur in the

deposit). Typical tasks undertaken in practical data mining are

Supervised (directed) learning

Classiﬁcation into the predeﬁned classes;

Estimation: extracting a value for some variable from the data;

Prediction: classifying according to possible future behaviour; estimating a

future value of the variable of interest;

Unsupervised (undirected) learning

Association rules (dependency modeling): determining which items belong

together;

Clustering:

grouping

items

according

distance

some

metric

(cf. Sect. 13.2);

Description and visualization. These tasks are in turn embedded in a wider

framework, comprising

Data cleansing, a complex process that can be automated regarding internal

inconsistencies, but which presently at least still requires human scrutiny of

the laboratory methods used to acquire the data;

Integration; this might merely mean merging disparate databases in a common

format;

Selection in case the entire database will not be used; irrelevant information

could be automatically eliminated during the main mining process, but it may

save signiﬁcant processing effort to carry out the elimination beforehand; note

that the criterion for irrelevance is preset;

Transformation: data might need to be transformed (in the same way that a

mathematical object could be represented in different coordinate systems) to

make items in a merged database compatible with each other;

Data mining proper (as described above); ¹¹

Pattern evaluation—human annotation of whatever emerges;

Visualization.

In the next section we look at a speciﬁc subset of data mining.

Problem. Discuss the autonomy of the data mining process.

10 See footnote 30 in Chap. 6.

11 See Mabu et al. (2018), Table 1 or Deepthi et al. (2019) for overviews of data mining algorithms

in bioinformatics.